{
 "metadata": {
  "name": "",
  "signature": "sha256:14292a5ce75b32480c1c9a9339628a7e801795cb6ec36e4a5166b8048fe05da1"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Dimensionality Reduction with the Shogun Machine Learning Toolbox"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "*By Sergey Lisitsyn ([lisitsyn](https://github.com/lisitsyn)) and Fernando J. Iglesias Garcia ([iglesias](https://github.com/iglesias)).*"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This notebook illustrates <a href=\"http://en.wikipedia.org/wiki/Unsupervised_learning\">unsupervised learning</a> using the suite of dimensionality reduction algorithms available in Shogun. Shogun provides access to all these algorithms using [Tapkee](http://tapkee.lisitsyn.me/), a C++ library especialized in <a href=\"http://en.wikipedia.org/wiki/Dimensionality_reduction\">dimensionality reduction</a>."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Hands-on introduction to dimension reduction"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First of all, let us start right away by showing what the purpose of dimensionality reduction actually is. To this end, we will begin by creating a function that provides us with some data:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy\n",
      "import os\nSHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')\n",
      "\n",
      "def generate_data(curve_type, num_points=1000):\n",
      "\tif curve_type=='swissroll':\n",
      "\t\ttt = numpy.array((3*numpy.pi/2)*(1+2*numpy.random.rand(num_points)))\n",
      "\t\theight = numpy.array((numpy.random.rand(num_points)-0.5))\n",
      "\t\tX = numpy.array([tt*numpy.cos(tt), 10*height, tt*numpy.sin(tt)])\n",
      "\t\treturn X,tt\n",
      "\tif curve_type=='scurve':\n",
      "\t\ttt = numpy.array((3*numpy.pi*(numpy.random.rand(num_points)-0.5)))\n",
      "\t\theight = numpy.array((numpy.random.rand(num_points)-0.5))\n",
      "\t\tX = numpy.array([numpy.sin(tt), 10*height, numpy.sign(tt)*(numpy.cos(tt)-1)])\n",
      "\t\treturn X,tt\n",
      "\tif curve_type=='helix':\n",
      "\t\ttt = numpy.linspace(1, num_points, num_points).T / num_points\n",
      "\t\ttt = tt*2*numpy.pi\n",
      "\t\tX = numpy.r_[[(2+numpy.cos(8*tt))*numpy.cos(tt)],\n",
      "\t\t             [(2+numpy.cos(8*tt))*numpy.sin(tt)],\n",
      "\t\t             [numpy.sin(8*tt)]]\n",
      "\t\treturn X,tt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The function above can be used to generate three-dimensional datasets with the shape of a [Swiss roll](http://en.wikipedia.org/wiki/Swiss_roll), the letter S, or an helix. These are three examples of datasets which have been extensively used to compare different dimension reduction algorithms. As an illustrative exercise of what dimensionality reduction can do, we will use a few of the algorithms available in Shogun to embed this data into a two-dimensional space. This is essentially the dimension reduction process as we reduce the number of features from 3 to 2. The question that arises is: what principle should we use to keep some important relations between datapoints? In fact, different algorithms imply different criteria to answer this question."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Just to start, lets pick some algorithm and one of the data sets, for example lets see what embedding of the Swissroll is produced by the Isomap algorithm. The Isomap algorithm is basically a slightly modified Multidimensional Scaling (MDS) algorithm which finds embedding as a solution of the following optimization problem:\n",
      "\n",
      "$$\n",
      "\\min_{x'_1, x'_2, \\dots} \\sum_i \\sum_j \\| d'(x'_i, x'_j) - d(x_i, x_j)\\|^2,\n",
      "$$\n",
      "\n",
      "with defined $x_1, x_2, \\dots \\in X~~$ and unknown variables $x_1, x_2, \\dots \\in X'~~$ while $\\text{dim}(X') < \\text{dim}(X)~~~$,\n",
      "$d: X \\times X \\to \\mathbb{R}~~$ and $d': X' \\times X' \\to \\mathbb{R}~~$ are defined as arbitrary distance functions (for example Euclidean). \n",
      "\n",
      "Speaking less math, the MDS algorithm finds an embedding that preserves pairwise distances between points as much as it is possible. The Isomap algorithm changes quite small detail: the distance - instead of using local pairwise relationships it takes global factor into the account with shortest path on the neighborhood graph (so-called geodesic distance). The neighborhood graph is defined as graph with datapoints as nodes and weighted edges (with weight equal to the distance between points). The edge between point $x_i~$ and $x_j~$ exists if and only if $x_j~$ is in $k~$ nearest neighbors of $x_i$. Later we will see that that 'global factor' changes the game for the swissroll dataset.\n",
      "\n",
      "However, first we prepare a small function to plot any of the original data sets together with its embedding."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%matplotlib inline\n",
      "import matplotlib.pyplot as plt\n",
      "from mpl_toolkits.mplot3d import Axes3D\n",
      "\n",
      "%matplotlib inline\n",
      "\n",
      "def plot(data, embedded_data, colors='m'):\n",
      "\tfig = plt.figure()\n",
      "\tfig.set_facecolor('white')\n",
      "\tax = fig.add_subplot(121,projection='3d')\n",
      "\tax.scatter(data[0],data[1],data[2],c=colors,cmap=plt.cm.Spectral)\n",
      "\tplt.axis('tight'); plt.axis('off')\n",
      "\tax = fig.add_subplot(122)\n",
      "\tax.scatter(embedded_data[0],embedded_data[1],c=colors,cmap=plt.cm.Spectral)\n",
      "\tplt.axis('tight'); plt.axis('off')\n",
      "\tplt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import Isomap, RealFeatures, MultidimensionalScaling\n",
      "\n",
      "# wrap data into Shogun features\n",
      "data, colors = generate_data('swissroll')\n",
      "features = RealFeatures(data)\n",
      "\n",
      "# create instance of Isomap converter and configure it\n",
      "isomap = Isomap()\n",
      "isomap.set_target_dim(2)\n",
      "# set the number of neighbours used in kNN search\n",
      "isomap.set_k(20)\n",
      "\n",
      "# create instance of Multidimensional Scaling converter and configure it\n",
      "mds = MultidimensionalScaling()\n",
      "mds.set_target_dim(2)\n",
      "\n",
      "# embed Swiss roll data\n",
      "embedded_data_mds = mds.embed(features).get_feature_matrix()\n",
      "embedded_data_isomap = isomap.embed(features).get_feature_matrix()\n",
      "\n",
      "plot(data, embedded_data_mds, colors)\n",
      "plot(data, embedded_data_isomap, colors)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As it can be seen from the figure above, Isomap has been able to \"unroll\" the data, reducing its dimension from three to two. At the same time, points with similar colours in the input space are close to points with similar colours in the output space. This is, a new representation of the data has been obtained; this new representation maintains the properties of the original data, while it reduces the amount of information required to represent it. Note that the fact the embedding of the Swiss roll looks good in two dimensions stems from the *intrinsic* dimension of the input data. Although the original data is in a three-dimensional space, its intrinsic dimension is lower, since the only degree of freedom are the polar angle and distance from the centre, or height. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, we use yet another method, Stochastic Proximity Embedding (SPE) to embed the helix:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import StochasticProximityEmbedding\n",
      "\n",
      "# wrap data into Shogun features\n",
      "data, colors = generate_data('helix')\n",
      "features = RealFeatures(data)\n",
      "\n",
      "# create MDS instance\n",
      "converter = StochasticProximityEmbedding()\n",
      "converter.set_target_dim(2)\n",
      "\n",
      "# embed helix data\n",
      "embedded_features = converter.embed(features)\n",
      "embedded_data = embedded_features.get_feature_matrix()\n",
      "\n",
      "plot(data, embedded_data, colors)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "References"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- Lisitsyn, S., Widmer, C., Iglesias Garcia, F. J. Tapkee: An Efficient Dimension Reduction Library. ([Link to paper in JMLR](http://jmlr.org/papers/v14/lisitsyn13a.html#!).)\n",
      "- Tenenbaum, J. B., de Silva, V. and Langford, J. B. A Global Geometric Framework for Nonlinear Dimensionality Reduction. ([Link to Isomap's website](http://isomap.stanford.edu/).)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
